Trigger Word Detection

Welcome to the final programming assignment of this specialization!

In this week's videos, you learned about applying deep learning to speech recognition. In this assignment, you will construct a speech dataset and implement an algorithm for trigger word detection (sometimes also called keyword detection, or wake word detection).

In this assignment you will learn to:

Updates

If you were working on the notebook before this update...

List of updates

Let's get started! Run the following cell to load the package you are going to use.

1 - Data synthesis: Creating a speech dataset

Let's start by building a dataset for your trigger word detection algorithm.

1.1 - Listening to the data

Run the cells below to listen to some examples.

You will use these three types of recordings (positives/negatives/backgrounds) to create a labeled dataset.

1.2 - From audio recordings to spectrograms

What really is an audio recording?

Spectrogram

Let's look at an example.

The graph above represents how active each frequency is (y axis) over a number of time-steps (x axis).

**Figure 1**: Spectrogram of an audio recording

Now, you can define:

Dividing into time-intervals

Note that we may divide a 10 second interval of time with different units (steps).

1.3 - Generating a single training example

Benefits of synthesizing data

Because speech data is hard to acquire and label, you will synthesize your training data using the audio clips of activates, negatives, and backgrounds.

Process for Synthesizing an audio clip

Pydub

Overlaying positive/negative 'word' audio clips on top of the background audio

Label the positive/negative words

Example
Synthesized data is easier to label

Visualizing the labels

**Figure 2**

Helper functions

To implement the training set synthesis process, you will use the following helper functions.

  1. get_random_time_segment(segment_ms)
    • Retrieves a random time segment from the background audio.
  2. is_overlapping(segment_time, existing_segments)
    • Checks if a time segment overlaps with existing segments
  3. insert_audio_clip(background, audio_clip, existing_times)
    • Inserts an audio segment at a random time in the background audio
    • Uses the functions get_random_time_segment and is_overlapping
  4. insert_ones(y, segment_end_ms)
    • Inserts additional 1's into the label vector y after the word "activate"

Get a random time segment

Check if audio clips are overlapping

Exercise:

  1. Create a "False" flag, that you will later set to "True" if you find that there is an overlap.
  2. Loop over the previous_segments' start and end times. Compare these times to the segment's start and end times. If there is an overlap, set the flag defined in (1) as True.

You can use:

for ....:
        if ... <= ... and ... >= ...:
            ...

Hint: There is overlap if:

Expected Output:

**Overlap 1** False
**Overlap 2** True

Insert audio clip

Exercise:

  1. Get the length of the audio clip that is to be inserted.
    • Get a random time segment whose duration equals the duration of the audio clip that is to be inserted.
  2. Make sure that the time segment does not overlap with any of the previous time segments.
    • If it is overlapping, then go back to step 1 and pick a new time segment.
  3. Append the new time segment to the list of existing time segments
    • This keeps track of all the segments you've inserted.
  4. Overlay the audio clip over the background using pydub. We have implemented this for you.

Expected Output

**Segment Time** (2254, 3169)

Insert ones for the labels of the positive target

Exercise: Implement insert_ones().

Expected Output

**sanity checks**: 0.0 1.0 0.0

Creating a training example

Finally, you can use insert_audio_clip and insert_ones to create a new training example.

Exercise: Implement create_training_example(). You will need to carry out the following steps:

  1. Initialize the label vector $y$ as a numpy array of zeros and shape $(1, T_y)$.
  2. Initialize the set of existing segments to an empty list.
  3. Randomly select 0 to 4 "activate" audio clips, and insert them onto the 10 second clip. Also insert labels at the correct position in the label vector $y$.
  4. Randomly select 0 to 2 negative audio clips, and insert them into the 10 second clip.

Expected Output

Now you can listen to the training example you created and compare it to the spectrogram generated above.

Expected Output

Finally, you can plot the associated labels for the generated training example.

Expected Output

1.4 - Full training set

1.5 - Development set

2 - Model

2.1 - Build the model

Our goal is to build a network that will ingest a spectrogram and output a signal when it detects the trigger word. This network will use 4 layers:

* A convolutional layer
* Two GRU layers
* A dense layer. 

Here is the architecture we will use.

**Figure 3**
1D convolutional layer

One key layer of this model is the 1D convolutional step (near the bottom of Figure 3).

GRU, dense and sigmoid

Unidirectional RNN

Implement the model

Implementing the model can be done in four steps:

Step 1: CONV layer. Use Conv1D() to implement this, with 196 filters, a filter size of 15 (kernel_size=15), and stride of 4. conv1d

output_x = Conv1D(filters=...,kernel_size=...,strides=...)(input_x)
output_x = Activation("...")(input_x)
output_x = Dropout(rate=...)(input_x)

Step 2: First GRU layer. To generate the GRU layer, use 128 units.

output_x = GRU(units=..., return_sequences = ...)(input_x)

Step 3: Second GRU layer. This has the same specifications as the first GRU layer.

Step 4: Create a time-distributed dense layer as follows:

X = TimeDistributed(Dense(1, activation = "sigmoid"))(X)

This creates a dense layer followed by a sigmoid, so that the parameters used for the dense layer are the same for every time step.
Documentation:

Exercise: Implement model(), the architecture is presented in Figure 3.

Let's print the model summary to keep track of the shapes.

Expected Output:

**Total params** 522,561
**Trainable params** 521,657
**Non-trainable params** 904

The output of the network is of shape (None, 1375, 1) while the input is (None, 5511, 101). The Conv1D has reduced the number of steps from 5511 to 1375.

2.2 - Fit the model

You can train the model further, using the Adam optimizer and binary cross entropy loss, as follows. This will run quickly because we are training just for one epoch and with a small training set of 26 examples.

2.3 - Test the model

Finally, let's see how your model performs on the dev set.

This looks pretty good!

3 - Making Predictions

Now that you have built a working model for trigger word detection, let's use it to make predictions. This code snippet runs audio (saved in a wav file) through the network.

Insert a chime to acknowledge the "activate" trigger

3.3 - Test on dev examples

Let's explore how our model performs on two unseen audio clips from the development set. Lets first listen to the two dev set clips.

Now lets run the model on these audio clips and see if it adds a chime after "activate"!

Congratulations

You've come to the end of this assignment!

Here's what you should remember:

Congratulations on finishing the final assignment!

Thank you for sticking with us through the end and for all the hard work you've put into learning deep learning. We hope you have enjoyed the course!

4 - Try your own example! (OPTIONAL/UNGRADED)

In this optional and ungraded portion of this notebook, you can try your model on your own audio clips!

Once you've uploaded your audio file to Coursera, put the path to your file in the variable below.

Finally, use the model to predict when you say activate in the 10 second audio clip, and trigger a chime. If beeps are not being added appropriately, try to adjust the chime_threshold.